## NYU Tandon School of Engineering Fall 2022, ECE 6913

## Homework Assignment 7

Instructor: Azeez Bhavnagarwala, email: ajb20@nyu.edu

## Course Assistants

Varadraj Kakodkar (vns2008), Kartikay Kaushik (kk4332), Siddhanth Iyer (si2152), Swarnashri Chandrashekar (sc8781), Karan Sheth (kk4332), Haotian Zheng (hz2687), Haoren Zhang (kk4332), Varun Kumar (vs2411)

<u>Homework Assignment 6</u> [released Friday November 11<sup>th</sup> 2022] [due Monday November 28<sup>th</sup> by 11:59PM]

You are allowed to discuss HW assignments with anyone. You are not allowed to share your solutions with other colleagues in the class. Please feel free to reach out to the Course Assistants or the Instructor during office hours or by appointment if you need any help with the HW. Please enter your responses in this Word document after you download it from NYU Classes. Please use the Brightspace portal to upload your completed HW.

1. Caches are important to providing a high-performance memory hierarchy to processors. Below is a list of 64-bit memory address references, given as word addresses.

0x02, 0xb3, 0x2a, 0x01, 0xbe, 0x57, 0xbf, 0x0d, 0xb6, 0x2b, 0xbc, 0xfd

- 1.1 For each of these references, identify the binary word address, the tag, and the index given a direct-mapped cache with 16 one-word blocks. Also list whether each reference is a hit or a miss, assuming the cache is initially empty.
- 1.2 For each of these references, identify the binary word address, the tag, the index, and the offset given a direct-mapped cache with two-word blocks and a total size of eight blocks. Also list if each reference is a hit or a miss, assuming the cache is initially empty.
- 1.3 You are asked to optimize a cache design for the given references. There are three direct-mapped cache designs possible, all with a total of eight words of data:
- C1 has 1-word blocks,
- C2 has 2-word blocks, and
- C3 has 4-word blocks.

**2.** Section 5.3 shows the typical method to index a direct-mapped cache, specifically (Block address) modulo (Number of blocks in the cache). Assuming a 64-bit address and 1024 blocks in the cache, consider a different indexing function, specifically (Block address[63:54] XOR Block address[53:44]). Is it possible to use this to index a direct-mapped cache? If so, explain why and discuss any changes that might need to be made to the cache. If it is not possible, explain why.

**3.** For a direct-mapped cache design with a 64-bit address, the following bits of the address are used to access the cache.

| Tag   | Index | Offset |  |
|-------|-------|--------|--|
| 63–10 | 9–5   | 4-0    |  |

- 3.1 What is the cache block size (in words)?
- 3.2 How many blocks does the cache have?
- 3.3 What is the ratio between total bits required for such a cache implementation over the data storage bits?

Beginning from power on, the following byte-addressed cache references are recorded.

|     | Address |    |    |     |     |     |      |    |     |      |     |      |
|-----|---------|----|----|-----|-----|-----|------|----|-----|------|-----|------|
| Hex | 00      | 04 | 10 | 84  | E8  | A0  | 400  | 1E | 8C  | C1C  | B4  | 884  |
| Dec | 0       | 4  | 16 | 132 | 232 | 160 | 1024 | 30 | 140 | 3100 | 180 | 2180 |

- 3.4 For each reference, list (1) its tag, index, and offset, (2) whether it is a hit or a miss, and (3) which bytes were replaced (if any).
- 3.5 What is the hit ratio?
- *3.6* List the final state of the cache, with each valid entry represented as a record of *<index, tag, data>*. For example,

## <0,3, Mem[oxCoo]-Mem[oxC1F]>

**4.** Recall that we have two write policies and two write allocation policies, and their combinations can be implemented either in L1 or L2 cache. Assume the following choices for L1 and L2 caches:

| L1                                | L2                         |
|-----------------------------------|----------------------------|
| Write through, non-write allocate | Write back, write allocate |

- 4.1 Buffers are employed between different levels of memory hierarchy to reduce access latency. For this given configuration, list the possible buffers needed between L1 and L2 caches, as well as L2 cache and memory.
- 4.2 Describe the procedure of handling an L1 write-miss, considering the components involved and the possibility of replacing a dirty block.
- 4.3 For a multilevel exclusive cache configuration (a block can only reside in one of the L1 and L2 caches), describe the procedures of handling an L1 write-miss and an L1 readmiss, considering the components involved and the possibility of replacing a dirty block.

**5.** Consider the following program and cache behaviors.

| D | Data Reads per 1000 Instructions | Data Writes per 1000 Instructions | Instruction Cache Miss Rate | Data Cache Miss Rate | Block Size (bytes) |
|---|----------------------------------|-----------------------------------|-----------------------------|----------------------|--------------------|
| 2 | 250                              | 100                               | 0.30%                       | 2%                   | 64                 |

- 5.1 Suppose a CPU with a write-through, writeallocate cache achieves a CPI of 2. What are the read and write bandwidths (measured by bytes per cycle) between RAM and the cache? (Assume each miss generates a request for one block.)
- 5.2 For a write-back, write-allocate cache, assuming 30% of replaced data cache blocks are dirty, what are the read and write bandwidths needed for a CPI of 2?

**6.** Media applications that play audio or video files are part of a class of workloads called "streaming" workloads (i.e., they bring in large amounts of data but do not reuse much of it). Consider a video streaming workload that accesses a 512 KiB working set sequentially with the following word address stream:

- 6.1 Assume a 64 KiB direct-mapped cache with a 32-byte block. What is the miss rate for the address stream above? How is this miss rate sensitive to the size of the cache or the working set? How would you categorize the misses this workload is experiencing, based on the 3C model?
- 6.2 Re-compute the miss rate when the cache block size is 16 bytes, 64 bytes, and 128 bytes. What kind of locality is this workload exploiting?
- 6.3 "Prefetching" is a technique that leverages predictable address patterns to speculatively bring in additional cache blocks when a particular cache block is accessed. One example of prefetching is a stream buffer that prefetches sequentially adjacent cache blocks into a separate buffer when a particular cache block is brought in. If the data are found in the prefetch buffer, it is considered as a hit, moved into the cache, and the next cache block is prefetched. Assume a two-entry stream buffer; and, assume that the cache latency is such that a cache block can be loaded before the computation on the previous cache block is completed. What is the miss rate for the address stream above?

7. Cache block size (B) can affect both miss rate and miss latency. Assuming a machine with a base CPI of 1, and an average of 1.35 references (both instruction and data) per instruction, find the block size that minimizes the total miss latency given the following miss rates for various block sizes.

| 8: 4% | 16: 3% | 32: 2% | 64: 1.5% | 128: 1% |
|-------|--------|--------|----------|---------|
|-------|--------|--------|----------|---------|

- 7.1 What is the optimal block size for a miss latency of 20 ×B cycles?
- 7.2 What is the optimal block size for a miss latency of 24 +B cycles?
- 7.3 For constant miss latency, what is the optimal block size?

**8.** In this exercise, we will look at the different ways capacity affects overall performance. In general, cache access time is proportional to capacity. Assume that main memory accesses take 70 ns and that 36% of all instructions access data memory. The following table shows data for L1 caches attached to each of two processors, P1 and P2.

|    | L1 Size | L1 Miss Rate | L1 Hit Time |
|----|---------|--------------|-------------|
| P1 | 2 KiB   | 8.0%         | 0.66 ns     |
| P2 | 4 KiB   | 6.0%         | 0.90 ns     |

- 8.1 Assuming that the L1 hit time determines the cycle times for P1 and P2, what are their respective clock rates?
- 8.2 What is the Average Memory Access Time for P1 and P2 (in cycles)?
- 8.3 Assuming a base CPI of 1.0 without any memory stalls, what is the total CPI for P1 and P2? Which processor is faster? (When we say a "base CPI of 1.0", we mean that instructions complete in one cycle, unless either the instruction access or the data access causes a cache miss.)

we will now consider the addition of an L2 cache to P1 (to presumably make up for its limited L1 cache capacity). Use the L1 cache capacities and hit times from the previous table when solving these problems. The L2 miss rate indicated is its local miss rate.

| L2 Size | L2 Miss Rate | L2 Hit Time |
|---------|--------------|-------------|
| 1 MiB   | 95%          | 5.62 ns     |

- 8.4 What is the AMAT for P1 with the addition of an L2 cache? Is the AMAT better or worse with the L2 cache?
- 8.5 Assuming a base CPI of 1.0 without any memory stalls, what is the total CPI for P1 with the addition of an L2 cache?
- 8.6 What would the L2 miss rate need to be in order for P1 with an L2 cache to be faster than P1 without an L2 cache?
- 8.7 What would the L2 miss rate need to be in order for P1 with an L2 cache to be faster than P2 without an L2 cache?

**9.** This exercise examines the effect of different cache designs, specifically comparing associative caches to the direct-mapped caches from *Section 5.4*. For these exercises, refer to the sequence of word address shown below.

```
0x03, 0xb4, 0x2b, 0x02, 0xbe, 0x58, 0xbf, 0x0e, 0x1f, 0xb5, 0xbf, 0xba, 0x2e, 0xce
```

- *9.1* Sketch the organization of a three-way set associative cache with two-word blocks and a total size of 48 words. Your sketch should have a style similar to *Figure 5.18*, but clearly show the width of the tag and data fields.
- 9.2 Trace the behavior of the cache from Exercise 9.1 Assume a true LRU replacement policy. For each reference, identify
  - the binary word address,
  - the tag,
  - the index,
  - the offset
  - whether the reference is a hit or a miss, and
  - which tags are in each way of the cache after the reference has been handled.
- *9.3* Sketch the organization of a fully associative cache with one-word blocks and a total size of eight words. Your sketch should have a style similar to Figure 5.18, but clearly show the width of the tag and data fields.
- 9.4 Trace the behavior of the cache from Exercise 9.3. Assume a true LRU replacement policy. For each reference, identify
  - the binary word address,
  - the tag,
  - the index,
  - the offset,
  - whether the reference is a hit or a miss
  - the contents of the cache after each reference has been handled.
- 9.5 Sketch the organization of a fully associative cache with two-word blocks and a total size of eight words. Your sketch should have a style similar to Figure 5.18, but clearly show the width of the tag and data fields.
- 9.6 Trace the behavior of the cache from Exercise 9.5. Assume an LRU replacement policy. For each reference, identify
  - the binary word address,
  - the tag,
  - the index,
  - the offset,
  - whether the reference is a hit or a miss,
  - the contents of the cache after each reference has been handled.

- 9.7 Repeat Exercise 9.6 using MRU (most recently used) replacement.
- 9.8 Repeat Exercise 9.6 using the optimal replacement policy (i.e., the one that gives the lowest miss rate).

**10.** Multilevel caching is an important technique to overcome the limited amount of space that a first-level cache can provide while still maintaining its speed. Consider a processor with the following parameters:

| Base CPI, No<br>Memory<br>Stalls | Processor<br>Speed | Main<br>Memory<br>Access<br>Time | First-Level Cache<br>Miss Rate per<br>Instruction" | Second-Level<br>Cache, Direct-<br>Mapped Speed | Miss Rate with Second-<br>Level Cache, Direct-<br>Mapped | Second-Level Cache,<br>Eight-Way Set<br>Associative Speed | Miss Rate with Second-Level<br>Cache, Eight-Way Set<br>Associative |
|----------------------------------|--------------------|----------------------------------|----------------------------------------------------|------------------------------------------------|----------------------------------------------------------|-----------------------------------------------------------|--------------------------------------------------------------------|
| 1.5                              | 2 GHz              | 100 ns                           | 7%                                                 | 12 cycles                                      | 3.5%                                                     | 28 cycles                                                 | 1.5%                                                               |

<sup>\*\*</sup>First Level Cache miss rate is per instruction. Assume the total number of L1 cache misses

(instruction and data combined) is equal to 7% of the number of instructions.

10.1 Calculate the CPI for the processor in the table using: 1) only a first-level cache, 2) a second-level direct mapped cache, and 3) a second-level eight-way set associative cache. How do these numbers change if main memory access time doubles? (Give each change as both an absolute CPI and a percent change.) Notice the extent to which an L2 cache can hide the effects of a slow memory.

10.2 It is possible to have an even greater cache hierarchy than two levels? Given the processor above with a second-level, direct-mapped cache, a designer wants to add a third-level cache that takes 50 cycles to access and will have a 13% miss rate. Would this provide better performance? In general, what are the advantages and disadvantages of adding a third-level cache?

10.3 In older processors, such as the Intel Pentium or Alpha 21264, the second level of cache was external (located on a different chip) from the main processor and the first-level cache. While this allowed for large second-level caches, the latency to access the cache was much higher, and the bandwidth was typically lower because the second-level cache ran at a lower frequency. Assume a 512 KiB off-chip second level cache has a miss rate of 4%. If each additional 512 KiB of cache lowered miss rates by 0.7%, and the cache had a total access time of 50 cycles, how big would the cache have to be to match the performance of the second-level direct-mapped cache listed above?